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ABSTRACT 


In  this  paper  we  present  preliminaiy  results  obtained  at  Dragon  Systems  on 
the  Resource  Management  bendrmark  task.  The  basic  conceptual  units  of 
our  system  are  Phonemes-in-Context  (PICs),  which  are  represented  as 
Hidden  Markov  Models,  eadi  of  which  is  expressed  as  a  sequence  d 
Phonetic  Elements  (PELs).  The  PELs  ootresponding  to  a  given  phoneme 
constitme  a  kind  of  alphabet  for  the  r^rresentadon  of  PICs. 

For  the  speaker^dqiendent  tests,  two  basic  methods  dt  training  the  aoousdc 
models  were  investigated.  The  first  method  of  training  the  Resource 
Management  models  is  to  re-estimate  the  models  for  each  test  speaker  from 
that  speaker*  s  training  data,  keqring  the  PEL  spellings  of  the  PICs  fixed.  The 
second  rqrproadi  is  to  use  the  te-estirttated  rriodels  from  the  first  method  to 
derive  a  segmentation  of  the  training  data,  then  to  respell  the  PICs  in  a  largely 
speaker-dependent  manner  in  order  to  improve  the  tqjtesentaticn  of  speaker 
(fiffetences.  A  full  explanation  <i  these  m^ods  is  given,  as  ate  results  using 
each  method. 

In  addition  to  reporting  on  two  different  training  strategies,  we  discuss  N- 
Best  results.  The  N-Best  algorithm  is  a  mo^cation  of  the  algorithm 
proposed  by  Soong  and  Huang  at  the  June  1990  workshop.  This  algorithm 
runs  as  a  post-ptooessing  step  and  uses  an  A* -search  (an  algorithm  also 
known  as  a  ‘st^  decoder'). 


1.  INTRODUCTION 

In  this  we  report  on  some  preliminary  wok  done  at 
Dragon  Systems'  on  ihe  Resource  Management  benchmark 
task.  First,  a  brief  overview  of  Dragon  Systems  speaker- 
dependent,  continuous  speech  recognition  system  is  givai. 
Next,  the  modificaticms  necessary  to  evaluate  this  system  on 
the  RM  task  are  desaibed.  Our  goal  has  been  to  make  changes 
to  the  standard  continuous  speech  recognition  system  in  ways 
that  are  in  line  with  Dragon’s  long  term  aims.  The  primary 
modifications  so  far  have  been  in  the  areas  of  signal  processing 
and  speaker-dependent  training.  The  speaker-dependent 
training  is  described  in  detail  in  Section  4. 


Recognition  results  are  given  for  the  RMl  speaker- 
dependent  development  test  data  and  for  the  Feb91  evaluation 
test  material.  In  presenting  these  results,  we  make  a  start  at 
evaluating  the  transfer  characteristics  of  our  system  when 
responding  to  changes  in  the  speaker,  the  hardware,  and  the 
signal  processing  algorithm.  Our  experimentation  was 
performed  using  the  speaker-dependent  development  test 
data,  and  these  data  are  used  to  compare  system  configurations 
in  this  paper.  Since  we  believe  that  we  are  still  on  a  steep 
learning  curve,  the  Felruary  1991  evaluation  test  material  was 
run  through  the  system  only  one  time,  and  thus  comparative 
results  using  the  evaluation  data  are  not  yet  available. 

2.  OVERVIEW  OF  THE  DRAGON 
CSR  SYSTEM 

Dragon  Systems’  continuous  speech  recognition  system 
was  presented  at  the  June  1990  DARPA  meeting  [12,3].  The 
system  is  speaker-dqjendent  and  was  demonstrated  to  be 
capable  of  near  real-time  performance  on  an  844  word  task 
(mammogr^hy  reports),  when  running  on  a  486-based  PC. 
The  signal  processing  is  performed  by  an  additional 
TMS32010-based  board.  The  speech  is  sampled  at  12  kHz  and 
the  signal  rqjresentation  is  quite  simple:  there  are  only  eight 
parameters  —  7  spectral  components  covaing  the  region  up 
to  3  kHz  and  an  overall  energy  parameter  —  a  complete  set  of 
which  are  computed  every  20  ms  and  used  as  input  to  the 
HMM-based  recognizer. 

The  fundamental  concqjtual  unit  used  in  the  system  is  the 
“phoneme-in-c(Mitext”  or  PIC,  where  the  word  “context”  in 


1.  This  woik  was  ^nsraed  by  the  Defense  Advanced  Research  Projects  Agency  and  was  monitored  by  the  Space  and  Naval 
Warfare  Systems  Command  under  Contract  N00O-39-86-C-O307. 


59 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

1991 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-1991  to  00-00-1991 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

Dragon  Systems  Resource  Management  Benchmark  Results  -February 

1 

5b.  GRANT  NUMBER 

i.yy± 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Dragon  Systems  Inc, 320  Nevada  Street, Newton, MA, 02160 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

18.  NUMBER 
OF  PAGES 

6 

19a.  NAME  OF 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


principle  refers  to  as  much  infixmation  about  the  surrounding 
phonetic  environment  as  is  necessary  to  determine  the  acoustic 
character  of  the  phoneme  in  question.  Several  related 
alternative  ^HS'oaches  have  ^)peared  in  the  literature  [5,6,7]. 
Currently,  cxMitext  for  our  models  includes  tte  identity  of  the 
preceding  and  succeeding  frfionemes  as  wdl  as  whether  the 
phoneme  is  in  a  prq)ausally  lengthened  segment  PICs  are 
modeled  as  a  sequence  of  PELs  (phonetic  elements),  each  of 
which  represents  a  “state”  in  an  HMM.  PELs  may  be  shared 
among  PIC  models  representing  the  same  phoneme.  A 
detailed  descriptiOT  of  models  for  PICs  and  how  they  are 
trained  may  be  found  in  [2].  Modifications  made  to  the  PIC 
training  procedure  are  presented  in  Secticxi  4. 

Recognition  uses  frame-synchronous  dynamic 
programming  to  extend  the  sentence  hypotheses  subject  to  the 
beam  pruning  used  to  eliminate  poor  paths.  Anotha-  important 
component  of  the  system  is  the  rapid  matcher,  described  in  [3], 
which  limits  the  numbCT  of  wwd  candidates  that  can  be 
hypotfiesized  to  start  at  any  given  flame.  Some  alternative 
approaches  to  the  rapid  match  problem  have  also  been  outlined 
by  others  [8,9,10]. 

3.  MODIFICATIONS  TO  THE  SYSTEM  FOR 
USE  WITH  THE  RM  TASK 

In  order  to  be  able  to  run  the  RM  benchmark  task  on  tfie 
Dragon  speaker-dependent  continuous  speech  recognition 
system,  several  modifications  were  necessary.  These 
modifications  primarily  concerned  the  signal  acquisition  and 
preprocessing  stages.  Prior  to  this  evaluation,  the  system  had 
only  been  evaluated  on  data  obtained  fiom  Dragon’s  own 
acquisition  hardware. 

The  signal  processing,  as  described  above,  has  always 
been  performed  by  the  signal  acquisition  board.  Thus  it  was 
thought  possible  that  the  performance  of  the  system  would  be 
highly  tuned  to  the  hardware.  In  order  to  run  the  RM  data 
through  the  system,  software  was  written  to  emulate  the 
hardware.  One  question  to  be  addressed  is  how  well  the  signal 
processing  software  does  in  fact  emulate  the  hardware.  To 
assess  this,  a  small  test  was  performed  using  new  data  finom 
Dragon’s  refereiKe  speaks.  The  speakCT  recorded,  using  the 
Dragon  hardware,  three  sets  of  100  sentences  selected  fiom 
the  development  test  texts  (those  of  BEF,  CMR,  and  DAS). 
Recognition  was  perfcamed,  using  the  reference  speaker’s 
base  models  after  addling  to  the  standard  training  saitences, 
and  an  average  wrad  errw  rate  of  3.5%  was  recorded.  The  feet 
that  the  rate  is  comparable  to  otot  rates  of  some  of  the  betto" 
RMl  speakos  suggests  that  we  have  emulated  our  standard 
signal  processing  reascmably  well.  An  explicit  comparison  of 
pCTformance  on  the  refaence  speaks  using  our  standard 
hardware  and  our  software  onulation  will  be  available  socxi. 


A  lexicon  for  the  RM  task  had  to  be  specified  before 
models  could  be  built.  Pronunciations  were  sufplied  for  each 
entry  in  the  SNOR  lexieem  by  extracting  them  fiom  our 
standard  lexicon.  Any  entries  not  found  in  Dragon’s  cucrait 
general  English  lexicon  woe  added  by  hand.  The  set  of 
phonemes  used  for  English  contains  24  consonants,  17  vowels 
(each  of  which  may  have  3  degrees  of  stress),  and  3  syllabic 
COTSonants.  Approximately  22%  of  the  entries  in  the  SNOR 
lexicon  have  been  given  multiple  pronunciations.  These 
pronunciations  may  reflect  stress  differences,  such  as  stressed 
and  unstressed  versions  of  function  words,  and  expected 
pronunciation  alternatives. 

Roughly  30,(XX)  PICs  are  used  in  modeling  the  vocabulary 
fix  this  task.  The  set  of  PICs  was  detennined  by  finding  all  of 
the  PICs  that  can  occur  given  the  constraint  that  sentences 
must  crxiform  to  the  word  pair  grammar.  The  training  data 
used  to  build  PIC  models  for  the  reference  qreaker  comes 
jximarily  fiom  general  English  isolated  words  and  phrases, 
supplemented  by  a  few  hundred  phrases  from  the  RMl 
training  sentences.  The  generation  and  ..training  of  PICs  is 
discussed  in  more  detail  in  the  next  section. 

The  language  model  used  in  the  CSR  system  returns  a  log 
pwobability  indicating  the  score  of  the  candidate  word.  This 
was  modified  to  return  a  fixed  score  if  the  word  is  allowed  by 
the  w(xd-pair  grammar  or  a  flag  ^noting  that  the  sequence  is 
impermissible. 

The  standard  r^id  match  module  was  used  in  all  of  the 
expoiments  reprxted  in  this  paper,  in  order  to  reduce  processing 
time.  We  have  not  focused  on  the  issue  of  processing  time  in 
the  current  phase  of  our  research,  and  have  therefore  modified 
our  standard  r^id  match  parameto'  settings  to  be  sufficiently 
conservative  so  as  to  insure  that  only  a  small  proportion  of  the 
OTors  are  due  to  r^id  match  mistakes. 

4.  TRAINING  ALGORITHMS  FOR  THE 
SPEAKER-DEPENDENT  MODELS 

Dragon’s  strategy  for  phoneme-based  training  was 
(tescribed  in  detail  in  an  earlier  report[2].  We  have  used  a  fiiUy 
automatic  version  of  the  same  strategy  to  build  speaker- 
dependent  models  for  each  of  the  RMl  speakers,  using  the 
reference  speaker’s  models  to  provide  an  initial  segmentation. 
The  goal  was  to  build  models  in  which  the  acoustic  parameters 
and  duration  estimates  wae  based  almost  aitirely  on  the  600 
training  utterances  for  each  speaker,  using  the  reference 
speaka-’s  models  only  in  rare  cases  fex  which  no  relevant 
training  data  is  available. 

The  recognition  model  f(x  a  word  (or  sentence)  is  obtained 
by  concatenating  a  sequence  of  PICs,  each  of  which  is,  in  urm, 
were  selected  in  the  course  of  the  semi-automatic  labeling  of 
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a  large  amount  of  data  acquired  fiom  the  reference  speaker, 
about  9000  isolated  wads  and  6000  shot  jfeases.  In  changing 
to  the  Resource  Managonent  task,  an  additional  set  of  task- 
specific  training  uttraances  Eton  the  refaence  speaker  were 
added  Although  less  than  10%  of  the  training  data  was  drawn 
from  the  Resource  Management  task,  most  of  the  PICs  that  ate 
legal  according  to  the  wad-pair  grammar  ate  represented 
somewhere  in  the  totd  training  set  Legal  PICs  missing  fiom 
the  training  set  ate  typically  like  the  sequence  “ah-uh-ee”  that 
would  occur  in  “WIOfiTA  A  EAST’:  fa  the  most  part,  they 
do  na  occur  in  the  training  sentetKXS  and  seon  unlikely  to 
occur  in  evaluadat  sentaices. 

The  refaence  speaka’s  models  are  speaker-dqrendent  in 
three  distinct  ways: 

1.  The  parametos  of  the  PELs  depend  on  the  ^rectral 
charactaistics  of  the  refaence  speaka’s  voice. 

2.  The  durations  fa  the  PELs  in  each  Markov  model  fa 
a  PIC  depend  cm  the  refaence  speaka’s  speaking  rate 
and  otha  features  of  his  speech. 

3.  The  sequence  of  PELs  used  in  the  Markov  model  fa 
a  PIC  depends  on  what  allc^rhone  the  refaence  speaker 
uses  in  a  given  context 

We  report  on  two  techniques  for  creating  speaker- 
dqrendent  PICs  starting  with  the  reference  speaka’s  models. 
The  first  is  a  straightforward  adulation  algoithm,  in  which  a 
new  speaka’s  training  utterances  ate  s^mented  into  PICs 
and  PELs  using  a  set  of  base  models,  and  ttie  segments  ate  then 
used  to  re-estimate  the  parameters  of  the  PELs  and  of  the 
duration  mockls.  This  algorithm  is  typically  run  multiple 
times.  This  t^poach  is  voy  effective  in  dealing  with  (1), 
since  the  600  training  sentences  include  data  fa  almost  all  of 
the  PELs.  This  strategy  is  less  effective  in  dealing  with  (2), 
since  only  about  6000  of  the  30000  PICs  occur  in  the  training 
scripts.  Adaptation  alone,  howeva,  can  do  nothing  to  change 
(3)  the  “spelling”  of  each  PIC  in  terms  of  PELs. 

The  first  technique  uses  the  following  two  steps: 

Step  1:  The  data  from  all  12  of  the  breakers  were  used  to 
adapt  the  reference  speaka’s  models.  Three  passes  of 
adaptaticm  were  performed  with  these  data.  Since  Dragon’s 
algorithm  does  na  ya  use  mixture  distributions,  this  has  the 
effect  of  avoaging  togetha  spectra  for  male  and  female 
talkos  and  generally  “washing  out”  formants  in  PELs  fa 
vowels.  The  resulting  ‘haulliple  ^reako'’  models  ate  na  good 
enough  to  do  speaka-independent  recognitiai,  but  they  serve 
as  a  betlCT  basis  fa  speaka  adaptation  than  do  the  refetoice 
speaka’s  models. 


Step  2:  Fa  a  given  qreaka,  a  maximum  of  six  passes  of 
ad^rtation  ate  carried  out,  starting  firom  the  multiple-speaker 
mo^ls.  The  resulting  models  are  used  to  segment  the  uttoances 
into  {rfioiones.  At  this  point  we  have  a  good  speaker- 
dependent  sa  of  PEL  models,  and  a  set  of  segmentations  with 
which  to  proceed  fiirtha. 


The  second  technique  begins  with  the  models  produced  by 
the  first  technique  togetha  with  the  segmentation  of  the 
training  data  into  phonemes  done  using  those  same  models. 
Using  this  automatic  labeling,  speaka-dependent  trairving  is 
perfomed  fa  each  of  the  RMl  speakers,  to  produce  a  new 
speaka-dqpendent  set  of  PIC  models  —  with  new  PEL 
spdlings  and  duration  models.  The  algoithm  is  as  foUows: 


Step  1:  Fa  each  phoneme  in  turn,  all  the  labeled  training 
data  for  that  phoneme  are  extracted  fion  the  training  soitences. 
Fa  each  PIC  that  involves  the  phoneme,  an  ^jpropriate 
weighted  average  of  these  data  is  taken  to  create  a  spectral 
model  (a  sequence  of  expected  values  fa  each  fiame)  fa  the 
PIC.  Details  of  this  avoaging  process  may  be  found  in  our 
earlier  r^tt[2],  but  the  key  idea  is  to  take  a  weighted  average 
of  phoneme  tokens  that  rqrresent  the  PIC  to  be  modeled  a 
closely  related  PICs. 

The  numba  of  PICs  to  be  ccaistructed  fa  each  phoneme 
is  of  the  same  oda  of  magnitude  as  the  numba  of  examples 
of  the  phoneme  in  the  600  training  sentences.  Since  there  are 
examples  of  only  about  6000  PICs  in  the  RMl  training 
sentences,  fa  most  PICs  the  models  must  be  based  oitirely  on 
data  with  eitha  the  left  or  right  context  incorrect  Fa  about 
one-fifth  of  the  30000  PICs,  thoewere  insufficient  related 
data  to  construct  a  qrectral  model  (using  the  usual  criteria  fa 
“relatedness”).  This  is  fiequoitly  the  case  when  a  diphone 
corespoiding  to  a  legal  wad  pair  fails  to  occur  in  the  training 
sentences. 

Step  2:  Dynamic  pogramming  is  used  to  construct  the 
sequence  of  PELs  that  best  rqjtesents  the  spectral  model  fa 
ea^  PIC,  thereby  “respelling”  toe  PIC  in  toms  of  PELs.  This 
results  in  a  qreaka-dqraident  PEL  spelling  fa  each  PIC.  In 
the  process,  speaka-dqroident  durations  for  each  PEL  in  a 
PIC  are  also  computed. 

Step  3:  Step  2  results  in  respelled  PICs  for  those  PICs  fa 
which  sufficient  training  data  are  available.  Fa  toe  remaining 
aipDximately  6000  PICs,  toe  adapted  PIC  models  of  toe 
refaence  speaka  are  used  (as  in  tedmique  1).  Merging  these 
PICs  results  in  a  model  fa  evoy  legal  PIC  in  toe  word-pair 
grammar. 
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Table  1;  Comparison  of  recognition  results  fw  RMl  speakos  using  the  two  methods  of  i^)eaker  training:  speaker  dqjendent 
models  (SI>-PELs)  and  speakCT-dependent  respelling  of  PICs  (SD  PICs).  Word  error  tales  are  reported  as  percentages  for  the 
RMl  ^velopment  test  data  and  the  Feb91  evaluation  data. 


SD-PELs 

Development 

SD-PICs 

Development 

SD-PICs 

Evaluation 

BEE 

10.5 

11 

6.3 

CMR(f) 

6.9 

6.8 

15.0 

DAS(f) 

4.3 

2.9 

1.9 

DMS(0 

4.1 

3.1 

3.6 

DTB 

7.6 

3.6 

7.2 

DTD(f) 

5.6 

4.4 

7.8 

ERS 

12.4 

12.6 

HXS(f) 

3.1 

2.5 

5.6 

JWS 

6.3 

4.7 

4.5 

PGH 

5.3 

5.5 

9.1 

RKM 

13.9 

9.8 

9.9 

TAB 

3.6 

4.3 

5.3 

Average 

7.0 

5.4 

7.5 

Step  4:  A  final  pass  of  adaptation  consists  of  resegmenting 
the  training  data  into  PELs  and  then  re-estimating  the 
patametes  of  the  speaker-dqtendoit  PELs.  In  the  process, 
duration  distributiois  are  also  re-estimated. 

The  above  algorithm  to  create  qxaker-dependent  PIC 
models  provides  two  sets  of  models  with  which  we  have 
experimented.  The  first  set  is  referred  to  as  speaker-dqrendent 
RM  modds.  The  second  set  is  the  ou^t  of  the  final  stage,  and 
is  refeied  to  as  the  replied  speaker-dqrendent  RM  models. 
Both  sets  of  speaker-dqjOTdent  modds  may  axitain  unchanged 
PICs  fiom  the  original  reference  speaker  when  no  training 
data  was  available  —  mainly  unchanged  duraticKi  models, 
since  most  fELs  are  used  in  a  variety  of  PICs. 

5.  RECOGNITION  EXPERIMENTS 
AND  DISCUSSION 

In  this  section  we  pesent  results  making  use  of  the  two  sets 
of  speaker  dependent  modds,  as  well  as  results  on  post 
processing  with  the  N-best  algorithm. 

5.1  Comparison  of  two  methods  for  speaker- 

dependent  training 

The  area"  rates  using  each  of  the  training  strategies  are 
shown  in  Table  1.  In  this  table  we  display  the  wOTd  enw  rates 
on  the  100  devetopment  test  sentences  fw  each  of  the  12  RMl 
speakers,  and  we  also  display  the  performance  of  the  tespelled 
models  cai  the  Feb91  evaluation  data,  which  consisted  of  25 
sentences  for  each  speaker. 


Table  2:  Cumulative  pacCTtage  of  correct  sentences  (xi  the 
choice  list  using  the  N-Best  algorithm. 


Choice  # 

Cumulative  % 

1 

72 

2 

83 

3 

87 

4 

88 

5 

90 

6 

91 

7 

92 

8 

92 

9 

93 

10 

93 

11 

93 

12 

93 

13 

93 

14 

93 

15 

94 
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Analy^  (A  Krors  for  Speaker-Dependent 

Respelled  PICs 

In  the  course  of  our  research  it  has  been  enlightening  to 
investigate  the  rarors.  We  will  now  focus  our  discussion  cm 
the  pofismatKe  of  the  respdled  models  wten  recognizing  the 
develqanent  data.  The  word  otot  rates  ate  seat  to  range  from 
a  low  of  25%  fry  ^jeaka-  HXS  to  105%  fw  ERS,  with  an 
overall  avaage  otot  rate  of  5.4%.  When  the  very  same 
systm  is  run  without  the  rapid  match  module,  the  amount  of 
computation  is  vastly  increased,  but  there  is  only  a  small 
leducticxi  of  the  obsCTved  ovraall  oior  rate  from  5.4%  to 
5.1%.  Roughly  62%  of  the  eras  involve  function  wmls 
only,  and  die  remaining  38%  involve  a  content  word  (and  may 
also  include  a  function  wad  ena).  Functioi  words  have  an 
entff  rate  of  7.6%  compared  to  25%  fa  content  wads.  The 
most  conmon  contoit  wad  enw  is  “SPS-40”  which  is  often 
misrecognized  as  “SPS48”.  Otha  content  word  errors  often 
involve  homophoies  (such  as  “ships+s”  — “ships”).  Function 
word  deletions  are  more  common  than  insertions,  and 
substitutions  may  be  symmetric  C‘and”  — “in”  are  as  frequent 
as  “in”  — “and”)  a  asymmetric  C'their”  — >  “the”  but  the 
revCTse  confusion  does  na  occur).  Other  common  errors 
involve  contractions:  “what  is”  ->  “what+s”  and  “when  will” 
— >  “whai+U”. 

Use  of  ahemate  pronimciations 

y^roximately  22%  of  the  lexical  entries  have  alternate 
pronunciations.  These  variants  are  used  to  express  expected 
pronunciation  alternations  and/a  stress  differences. 

52  N-Best  Algorithm  Test. 

A  recognition  piass  using  an  N-Best  algorithm  was 
perfomed  on  the  development  test  data.  The  N-Best  algoithm 
which  we  have  implemented  is  similar  to  the  oie  pxoposed  by 
Soong  and  Huang[4].  It  runs  as  a  pxDst-pxocessing  step  and  is 
essentially  a  stack  decoda  which  processes  the  speech  in 
reverse  time.  Computational  results  saved  during  the  forward 
pass  are  used  to  provide  very  close  ^iproximations  to  the  best 
score  of  a  fiill  transcription  which  extends  a  reverse  partial 
transcription.  Although  a  mote  complete  description  of  the 
algorithm  is  beyond  the  scope  of  the  praper,  we  note  that  a  key 
difference  between  tte  algorithm  we  use  and  that  of  Soong 
and  Huang  is  that  we  do  a  full  acoustic  match  in  the  reverse 
pass  (i.e.,  we  process  the  speech  data).  Also,  the  reason  our 
extensioi  scores  ate  only  approximate  is  that  in  our  current 
implementatiai,  the  forward  smd  revase  acoustic  match  scoes 
are  diffoent 

The  test  was  run  on  the  1200  utterances  fiiom  the  RMl 
devdcpnent  sentences,  100  each  fiom  the  12  RMl  speakers. 


The  parametas  controlling  the  N-Best  woe  set  consavatively. 
With  high  confidence,  the  100  best  alternative  sentence 
transcriptions  were  delivered  (slowing  down  the  recognition 
by  about  a  facta  of  six).  These  transcriptions  included  ones 
differing  oily  in  placement  of  internal  pauses  and/a  altanative 
pronunciations.  If  such  transcriptions  are  considered  identical, 
17  choices  were  delivered  on  average.  The  results  given  below 
do  consida  such  transcriprtions  as  being  identical. 

The  forward  algorithm  determined  the  correct  transcription 
70%  of  the  time,  and  the  N-Best  algorithm  delivered  it  as  a 
choice  94%  of  the  time  (almost  always  as  one  of  the  top  15). 
That  is,  fa  around  80%  of  the  misrecognitions,  the  correction 
was  on  tfie  choice  UsL  A  cumulative  count  (based  on  the  1200 
test  uttoances)  is  given  in  Table  2.  Fa  instance,  the  correct 
transcription  was  one  of  the  top  5  choices  90%  of  the  time. 
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